Telecom Churn Rate Dataset - Exploratory and Descriptive Analysis

Author
Affiliation

Adeline&Annick

Junior Data Analyst

Published

June 30, 2025

This notebook is focused on the exploratory and descriptive analysis of the cleaned version of the Telecom Churn Rate Dataset.To begin with, we import all the necessary Python libraries required for data handling, analysis, and processing.

Code
# import libraries
import pandas as pd
import numpy as np
import os
import plotly.express as px

1 Define and create Paths

We define the folder structure to organize our project. This includes:

  • data/raw for raw data files
  • data/processed for cleaned and transformed datasets
  • results for any output files such as plots or models
  • docs for documentation or rendered outputs (e.g., HTML, PDF)

We also ensure these directories are created if they don’t already exist.

2 Loading the Dataset

Now we load the raw dataset (Telecom Churn Rate Dataset) from the raw directory using pandas.read_excel().

Missing values are identified using the placeholder ?, which we convert to NaN.

Code
telecom_filename = os.path.join(processed_dir,'cleaned.xlsx')
telecom_df = pd.read_excel(telecom_filename)
telecom_df.head(10)
Telecom Churn Rate Dataset
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges numAdminTickets numTechTickets Churn
0 7590-vhveg female 0 yes no 1 no no phone service digital subscriber line no ... no no month-to-month yes electronic check 29.85 29.85 0 0 no
1 5575-gnvde male 0 no no 34 yes no digital subscriber line yes ... no no one year no mailed check 56.95 1889.50 0 0 no
2 3668-qpybk male 0 no no 2 yes no digital subscriber line yes ... no no month-to-month yes mailed check 53.85 108.15 0 0 yes
3 7795-cfocw male 0 no no 45 no no phone service digital subscriber line yes ... no no one year no bank transfer (automatic) 42.30 1840.75 0 3 no
4 9237-hqitu female 0 no no 2 yes no fiber optic no ... no no month-to-month yes electronic check 70.70 151.65 0 0 yes
5 9305-cdskc female 0 no no 8 yes yes fiber optic no ... yes yes month-to-month yes electronic check 99.65 820.50 0 0 yes
6 1452-kiovk male 0 no yes 22 yes yes fiber optic no ... yes no month-to-month yes credit card (automatic) 89.10 1949.40 0 0 no
7 6713-okomc female 0 no no 10 no no phone service digital subscriber line yes ... no no month-to-month no mailed check 29.75 301.90 0 0 no
8 7892-pookp female 0 yes no 28 yes yes fiber optic no ... yes yes month-to-month yes electronic check 104.80 3046.05 0 2 yes
9 6388-tabgu male 0 no yes 62 yes no digital subscriber line yes ... no no one year no bank transfer (automatic) 56.15 3487.95 0 0 no

10 rows × 23 columns

Code
telecom_filename = os.path.join(processed_dir,'cleaned.xlsx')
telecom_df = pd.read_excel(telecom_filename)
telecom_df.head(10)

We use .shape to see the number of rows and columns in the dataset. This helps us understand the scale of the data.

The .info() method gives a concise summary of the dataset, including:

  • Column names and types
  • Number of non-null entries
  • Memory usage

This helps us assess the completeness of the data and identify potential issues.

Code
summary_df = pd.DataFrame({
    'Column': telecom_df.columns,
    'Data Type':telecom_df.dtypes.values,
    'Missing Values': telecom_df.isnull().sum().values
})
summary_df
Table 1: Overview of dataset columns, their data types, and the count of missing values in each column.
Column Data Type Missing Values
0 customerID object 0
1 gender object 0
2 SeniorCitizen int64 0
3 Partner object 0
4 Dependents object 0
5 tenure int64 0
6 PhoneService object 0
7 MultipleLines object 0
8 InternetService object 0
9 OnlineSecurity object 0
10 OnlineBackup object 0
11 DeviceProtection object 0
12 TechSupport object 0
13 StreamingTV object 0
14 StreamingMovies object 0
15 Contract object 0
16 PaperlessBilling object 0
17 PaymentMethod object 0
18 MonthlyCharges float64 0
19 TotalCharges float64 0
20 numAdminTickets int64 0
21 numTechTickets int64 0
22 Churn object 0

3 Summary statistics

Numerical variables

Code
telecom_df.describe()
Table 2: Summary statistics for numerical variables in the dataset, including count, mean, standard deviation, min, and quartile values.
SeniorCitizen tenure MonthlyCharges TotalCharges numAdminTickets numTechTickets
count 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000 7043.000000
mean 0.162147 32.371149 64.761692 2279.734304 0.515689 0.419566
std 0.368612 24.559481 30.090047 2266.794470 1.275299 1.250117
min 0.000000 0.000000 18.250000 0.000000 0.000000 0.000000
25% 0.000000 9.000000 35.500000 398.550000 0.000000 0.000000
50% 0.000000 29.000000 70.350000 1394.550000 0.000000 0.000000
75% 0.000000 55.000000 89.850000 3786.600000 0.000000 0.000000
max 1.000000 72.000000 118.750000 8684.800000 5.000000 9.000000

4 Categorical Variables

Code
telecom_df.describe(include='object')
Table 3: Summary statistics for categorical variables in the dataset, including count, mean, standard deviation, min, and quartile values.
customerID gender Partner Dependents PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod Churn
count 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043 7043
unique 7043 2 2 2 2 3 3 3 3 3 3 3 3 3 2 4 2
top 7590-vhveg male no no yes no fiber optic no no no no no no month-to-month yes electronic check no
freq 1 3555 3641 4933 6361 3390 3096 3498 3088 3095 3473 2810 2785 3875 4171 2365 5174
Code
telecom_df['gender'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 4: Proportion of each category in the gender variable.
unique values proportion
0 male 0.504756
1 female 0.495244
Code
telecom_df['Partner'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 5: Proportion of each category in the partner variable.
unique values proportion
0 no 0.516967
1 yes 0.483033
Code
telecom_df['Dependents'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 6: Proportion of each category in the Dependents variable.
unique values proportion
0 no 0.700412
1 yes 0.299588
Code
telecom_df['PhoneService'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 7: Proportion of each category in the PhoneService variable.
unique values proportion
0 yes 0.903166
1 no 0.096834
Code
telecom_df['MultipleLines'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 8: Proportion of each category in the MultipleLines variable.
unique values proportion
0 no 0.481329
1 yes 0.421837
2 no phone service 0.096834
Code
telecom_df['InternetService'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 9: Proportion of each category in the InternetServices variable.
unique values proportion
0 fiber optic 0.439585
1 digital subscriber line 0.343746
2 no 0.216669
Code
telecom_df['OnlineSecurity'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 10: Proportion of each category in the OnlineSecurity variable.
unique values proportion
0 no 0.496663
1 yes 0.286668
2 no internet service 0.216669
Code
telecom_df['OnlineBackup'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 11: Proportion of each category in the OnlineBackup variable.
unique values proportion
0 no 0.438450
1 yes 0.344881
2 no internet service 0.216669
Code
telecom_df['DeviceProtection'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 12: Proportion of each category in the DeviceProtection variable.
unique values proportion
0 no 0.439443
1 yes 0.343888
2 no internet service 0.216669
Code
telecom_df['TechSupport'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 13: Proportion of each category in the TechSupport variable.
unique values proportion
0 no 0.493114
1 yes 0.290217
2 no internet service 0.216669
Code
telecom_df['StreamingTV'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 14: Proportion of each category in the StreamingTV variable.
unique values proportion
0 no 0.398978
1 yes 0.384353
2 no internet service 0.216669
Code
telecom_df['Contract'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 15: Proportion of each category in the Contract variable.
unique values proportion
0 month-to-month 0.550192
1 two year 0.240664
2 one year 0.209144
Code
telecom_df['PaperlessBilling'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 16: Proportion of each category in the PaperlessBilling variable.
unique values proportion
0 yes 0.592219
1 no 0.407781
Code
telecom_df['PaymentMethod'].value_counts(normalize=True).rename_axis('unique values').reset_index(name='proportion')
Table 17: Proportion of each category in the PaymentMethod variable.
unique values proportion
0 electronic check 0.335794
1 mailed check 0.228880
2 bank transfer (automatic) 0.219225
3 credit card (automatic) 0.216101

5 Understanding the Churn Column

In our dataset, there is a Churn column that indicates whether a customer has left the telecom service or not:

  • Yes – The customer has churned, meaning they have canceled their service or switched to another provider.
  • No – The customer is still active, meaning they have * the telecom company.

6 Churn rate

What is Churn Rate?

In the telecom industry, Churn Rate refers to the percentage of customers who stop using the company’s services during a specific period of time.

It is a key performance indicator (KPI) used to measure customer retention and loportant?

High churn rates can signal:

  • Poor customer satisfaction
  • High competition
  • Issues with service quality or pricing

Reducing churn is crucial for maintaining revenue and customer base.

not churned churned
0 no 0.73463
1 yes 0.26537

After analyzing the dataset, we observe the following churn rate:

  • 26.5% of customers have churned (Churn = Yes)

  • 73.5% of customers have remained loyal (Churn = No)

  • Out of every 100 customers, about 27 leave the telecom service, while 73 continue using it.

  • A churn rate of 26.5% is relatively high in the telecom industry and could be a sign of underlying issues such as:

    • Poor customer service
    • Uncompetitive pricing
    • Limited network coverage
    • Inadequate value propositions compared to competitors

Understanding churn rate is essential because:

  • Customer acquisition costs are high. It’s often more expensive to gain a new customer than to retain an existing one.
  • A high churn rate can lead to revenue loss and affect the company’s growth and sustainability.
  • Identifying the factors that lead to churn enables the company to take targeted actions to reduce it.

This insight sets the foundation for deeper analysis, such as: - Segmenting churned users - Identifying top predictors of churn - Designing customer retention strategies

7 Gender Distribution

C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2785251253.py:50: DeprecationWarning:


Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.


C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2785251253.py:52: DeprecationWarning:


Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.

The dataset includes customer information classified by gender. Here’s the breakdown:

  • Female: 3,488 customers

  • Male: 3,555 customers

  • The gender distribution is nearly balanced, with 51.9% male and 48.1% female.

  • Since the difference is small, gender is unlikely to be a strong biasing factor in this dataset.

8 Payment Method

The dataset includes information about the preferred payment methods used by customers. The distribution is as follows:

  • Electronic Check: 34%
  • Mailed Check: 23%
  • Bank Transfer (Automatic): 22%
  • Credit Card (Automatic): 22%
Code
telecom_df_payment_method = telecom_df['PaymentMethod'].value_counts(normalize=True).round(2).reset_index(name='payment_method_rate')
telecom_df_payment_method
PaymentMethod payment_method_rate
0 electronic check 0.34
1 mailed check 0.23
2 bank transfer (automatic) 0.22
3 credit card (automatic) 0.22
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\751501223.py:48: DeprecationWarning:


Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.


C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\751501223.py:50: DeprecationWarning:


Support for the 'engine' argument is deprecated and will be removed after September 2025.
Kaleido will be the only supported engine at that time.

  • Electronic Check is the most common payment method, used by over one-third of the customers.
  • Automatic payment methods (bank transfer and credit card) together account for 44%, indicating that a significant portion of customers prefer convenience and automation.
  • The presence of manual methods like mailed checks (23%) suggests that some customers may still prefer traditional payment options.

Understanding the distribution of payment methods is important for: - Customer segmentation - Identifying churn risks (e.g., manual payments may be associated with higher churn) - Designing targeted financial services or offers

9 Paperless Billing

The dataset tracks whether customers opted for paperless billing or not. The distribution is as follows:

  • Yes (Paperless Billing): 59%
  • No (Paper Billing): 41%
Code
telecom_df_billing= telecom_df['PaperlessBilling'].value_counts(normalize=True).reset_index(name='paperless_rate').round(2)
telecom_df_billing
PaperlessBilling paperless_rate
0 yes 0.59
1 no 0.41
  • A majority of customers (59%) prefer receiving their bills electronically, which suggests a strong adoption of digital billing methods.
  • However, 41% still receive paper bills, indicating that a significant portion of customers may:
    • Be less tech-savvy
    • Prefer traditional methods
    • Have limited access to digital platforms

Understanding billing preferences can help the company: - Promote environmentally friendly practices - Reduce mailing costs

10 Number of Contracts

Customers in the dataset are categorized based on the type of contract they hold. Here’s the breakdown:

  • Month-to-Month: 3,875 customers
  • One Year: 1,473 customers
  • Two Year: 1,695 customers
Code
telecom_df_contract = telecom_df.groupby('Contract').size().reset_index(name='Number_of_contract')
telecom_df_contract['Number_of_contract'] = telecom_df_contract['Number_of_contract'].round(2)
telecom_df_contract
Contract Number_of_contract
0 month-to-month 3875
1 one year 1473
2 two year 1695
  • The majority of customers (~57%) are on month-to-month contracts, which are typically more flexible and easier to cancel.
  • Long-term contracts (one-year and two-year combined) account for ~43% of the customer base.
  • Customers on month-to-month contracts may be more likely to churn, as they face fewer barriers to switching providers.
  • In contrast, those with longer contracts may demonstrate higher retention due to commitment incentives or penalties for early termination.

11 Statistical test

In this analysis, we examined whether MonthlyCharges vary significantly among customers with different contract types. The three categories of contracts analyzed were Month-to-month, One year, and Two year. Understanding how monthly charges differ across these groups helps uncover potential pricing strategies and customer behavior patterns associated with each contract type.

Results Interpretation

p-value Significant? Conclusion
< 0.05 Yes Reject H₀ – No Significant difference in MonthlyCharges across contract types
≥ 0.05 No Fail to Reject H₀ – significant difference in MonthlyCharges across contract types
C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2235245424.py:16: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


C:\Users\ADELINE PC\AppData\Local\Temp\ipykernel_13196\2235245424.py:29: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

Shapiro-Wilk Normality Test p-values:
Month-to-month: p = 0.0000
One year: p = 0.0000
Two year: p = 0.0000

Levene’s Test p-value: 0.0000

Using Welch’s ANOVA (unequal variances)
                sum_sq      df         F        PR(>F)
Contract  3.182336e+04     2.0  17.67295  2.207693e-08
Residual  6.338399e+06  7040.0       NaN           NaN
ANOVA F-statistic: 17.6730
ANOVA p-value: 0.0000
Reject H₀: No Significant difference in MonthlyCharges across contract types.

12 Conclusion

The statistical analysis confirms that contract type significantly influences monthly charges. Customers with Month-to-month contracts pay more on average, which may be due to:

-Lack of discounts available in long-term plans, -Flexibility priced at a premium,

13 Recommendation:

Telecom providers might use this insight to design targeted retention strategies—such as offering discounts or bundling services—to encourage month-to-month customers to switch to long-term contracts, potentially improving customer lifetime value.